Skip to content

Refactor code#72

Merged
tastelikefeet merged 6 commits intomodelscope:mainfrom
tastelikefeet:feat/refactor-0224
Feb 24, 2026
Merged

Refactor code#72
tastelikefeet merged 6 commits intomodelscope:mainfrom
tastelikefeet:feat/refactor-0224

Conversation

@tastelikefeet
Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @tastelikefeet, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on refactoring core utility modules, especially those related to platform detection and distributed computing. The changes enhance code organization, improve maintainability, and address a critical bug in GPU resource allocation. By moving platform-specific logic into a dedicated package and refining type hints, the codebase becomes more robust and easier to extend. Additionally, temporary helper functions were removed, cleaning up the API.

Highlights

  • Platform Utilities Refactoring: The core platform detection and utility functions, previously in twinkle/utils/platform.py, have been refactored into a new twinkle/utils/platforms package for better modularity and organization. This includes separating base platform logic, GPU, NPU, and MPS specific implementations.
  • GPU Allocation Logic Fix: A bug in the resource manager's GPU allocation logic was addressed, specifically correcting how gpu_ranks_local are determined to ensure GPUs for a worker are correctly assigned from the visible_devices list based on node rank.
  • Deferred Type Hinting for Torch: Type hints for torch.Size and torch.dtype were updated in checkpoint_engine/base.py to use string literals and TYPE_CHECKING to avoid circular imports and allow for deferred evaluation, improving type checking robustness.
  • Removal of Temporary Advantage Helpers: Temporary helper functions compute_advantages and compute_advantages_rloo were removed from twinkle/advantage/__init__.py, as they were marked for removal once a proper advantage computation API was available.
  • Relocation of Stateless Process Group Initialization: The stateless_init_process_group function was moved from twinkle/utils/network.py to twinkle/utils/torch_utils.py to consolidate PyTorch-related utilities.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/twinkle/advantage/init.py
    • Removed temporary compute_advantages and compute_advantages_rloo helper functions and their exports from __all__.
  • src/twinkle/checkpoint_engine/base.py
    • Added TYPE_CHECKING import for conditional torch import.
    • Updated torch.Size and torch.dtype type hints to string literals for deferred evaluation.
  • src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py
    • Updated import path for find_free_port, find_node_ip, is_valid_ipv6_address, and stateless_init_process_group from twinkle.utils.network to twinkle.utils.
  • src/twinkle/infra/_ray/resource_manager.py
    • Removed an assertion len(groups) == len(visible_devices).
    • Corrected the logic for determining gpu_ranks_local to properly index into self.visible_devices using node_rank.
  • src/twinkle/model/base.py
    • Updated import path for ensure_hccl_socket_env from twinkle.utils.network to twinkle.utils.platforms.
  • src/twinkle/model/megatron/megatron.py
    • Updated import path for is_last_rank from twinkle.utils.platform to twinkle.utils.
  • src/twinkle/model/megatron/model/gpt_bridge.py
    • Updated import path for is_last_rank from twinkle.utils.platform to twinkle.utils.
  • src/twinkle/sampler/vllm_sampler/vllm_engine.py
    • Updated import path for Platform from twinkle.utils.platform to twinkle.utils.
    • Modified get_vllm_device_uuid call to use Platform.get_vllm_device_uuid.
  • src/twinkle/sampler/vllm_sampler/vllm_sampler.py
    • Updated import path for Platform from twinkle.utils.platform to twinkle.utils.
  • src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py
    • Updated import path for Platform from twinkle.utils.platform to twinkle.utils.
    • Modified get_vllm_device_uuid call to use Platform.get_vllm_device_uuid.
  • src/twinkle/utils/init.py
    • Added imports for DeviceGroup, DeviceMesh, is_last_rank, is_master from device_mesh.
    • Added is_valid_ipv6_address from network.
    • Updated imports for GPU, NPU, Platform, ensure_hccl_socket_env, ensure_npu_backend from the new platforms package.
    • Added imports for pad_sequence_to_length, selective_log_softmax, stateless_init_process_group from torch_utils.
  • src/twinkle/utils/device_mesh.py
    • Renamed from src/twinkle/utils/platform.py.
    • Removed hashlib, platform, re, shutil, socket, subprocess, ABC, lru_cache imports.
    • Removed the Platform, GPU, NPU, MPS classes and related functions (_resolve_ascend_physical_device_id, _get_npu_bus_id_from_npu_smi, get_vllm_device_uuid).
    • Added import for Platform from platforms.
  • src/twinkle/utils/framework.py
    • Updated import path for DeviceMesh and Platform from platform to device_mesh.
  • src/twinkle/utils/logger.py
    • Updated import path for Platform from platform to platforms.
  • src/twinkle/utils/network.py
    • Removed os, torch, timedelta imports.
    • Removed _HCCL_IF_BASE_PORT_ENV, _HCCL_HOST_SOCKET_PORT_RANGE_ENV, _HCCL_NPU_SOCKET_PORT_RANGE_ENV constants.
    • Removed _derive_hccl_socket_env_defaults and _ensure_hccl_socket_env functions.
    • Removed stateless_init_process_group function.
  • src/twinkle/utils/platforms/init.py
    • Added new file to export Platform from base, GPU from gpu, MPS and is_mps_available from mps, and NPU, ensure_hccl_socket_env, ensure_npu_backend from npu.
  • src/twinkle/utils/platforms/base.py
    • Added new file defining the abstract Platform class with static methods for platform-agnostic operations and device UUID retrieval.
  • src/twinkle/utils/platforms/gpu.py
    • Added new file defining the GPU platform implementation, including visible_device_env, device_prefix, get_local_device, device_backend, and get_vllm_device_uuid.
  • src/twinkle/utils/platforms/mps.py
    • Added new file defining the MPS platform implementation, including is_mps_available, visible_device_env, device_prefix, get_local_device, device_backend, and get_vllm_device_uuid (raising NotImplementedError).
  • src/twinkle/utils/platforms/npu.py
    • Added new file defining the NPU platform implementation, including HCCL environment variable handling (_derive_hccl_socket_env_defaults, ensure_hccl_socket_env), NPU device ID resolution (_resolve_ascend_physical_device_id, _get_npu_bus_id_from_npu_smi), NPU backend assurance (ensure_npu_backend), and get_vllm_device_uuid with NPU-specific fallback logic.
  • src/twinkle/utils/safetensors.py
    • Updated import path for is_last_rank and is_master from platform to device_mesh.
  • src/twinkle/utils/torch_utils.py
    • Added imports for socket, timedelta, and is_valid_ipv6_address.
    • Moved the stateless_init_process_group function from twinkle/utils/network.py to this file.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request refactors the twinkle.utils.platform module into twinkle.utils.device_mesh and twinkle.utils.platforms to better organize platform-specific functionalities. It also removes temporary helper functions for advantage computation and updates imports across several files to reflect these changes. The changes improve code organization and maintainability by centralizing platform-related logic.

I am having trouble creating individual review comments. Click here to see my feedback.

src/twinkle/infra/_ray/resource_manager.py (198-199)

high

The logic for determining gpu_ranks_local has been refactored. Instead of directly indexing visible_device_list, it now iterates through worker_ranks and uses self.visible_devices[node_rank][r % nproc_per_node]. This change is necessary because self.visible_devices is now a list of lists (list of devices per node), whereas visible_device_list was likely a flattened list of all visible devices. This is a critical fix for correctly mapping GPU ranks to their local device IDs on specific nodes.

                        gpu_ranks_local = []
                        for r in worker_ranks:
                            node_rank = r // nproc_per_node
                            node_ranks.append(node_rank)
                            gpu_ranks = self.visible_devices[node_rank][r % nproc_per_node]
                            gpu_ranks_local.append(gpu_ranks)

src/twinkle/infra/_ray/resource_manager.py (214)

high

Similar to the previous comment, the change from visible_device_list[alloc_rank % nproc_per_node] to self.visible_devices[node_rank][alloc_rank % nproc_per_node] is crucial for correctly indexing the visible devices based on the node rank. This ensures that the correct GPU is assigned to the worker, reflecting the new structure of self.visible_devices.

                        gpu_rank = self.visible_devices[node_rank][alloc_rank % nproc_per_node]

src/twinkle/utils/init.py (10)

medium

The import of GPU, NPU, Platform, ensure_hccl_socket_env, and ensure_npu_backend is now from .platforms. This is a significant refactoring, consolidating all platform-specific logic into a new platforms submodule. This improves the organization of the utils package by grouping related functionalities.

from .platforms import GPU, NPU, Platform, ensure_hccl_socket_env, ensure_npu_backend

src/twinkle/utils/init.py (12)

medium

The import of pad_sequence_to_length, selective_log_softmax, and stateless_init_process_group is now from .torch_utils. This indicates that stateless_init_process_group, previously in network.py, has been moved to torch_utils.py, suggesting it's more closely related to PyTorch utilities. This improves the logical grouping of functions.

from .torch_utils import pad_sequence_to_length, selective_log_softmax, stateless_init_process_group, to_device

src/twinkle/checkpoint_engine/base.py (102)

medium

Similar to the previous change, using 'torch.Tensor' as a string literal for type hinting is appropriate here, especially since torch is now conditionally imported for type checking. This maintains type correctness without requiring a runtime import of torch if it's not otherwise used in the module.

    async def send_weights(self, weights: Generator[tuple[str, 'torch.Tensor'], None, None]):

src/twinkle/checkpoint_engine/base.py (115)

medium

Consistent application of string literal type hints for torch.Tensor is good for the same reasons mentioned above.

    async def receive_weights(self) -> AsyncGenerator[tuple[str, 'torch.Tensor'], None]:

src/twinkle/infra/_ray/resource_manager.py (176)

medium

The removal of assert len(groups) == len(visible_devices) indicates a change in how visible_devices is expected to be used or initialized. This assertion might have been too restrictive or incorrect given the new logic. Ensure that the new logic correctly handles cases where len(groups) and len(visible_devices) might differ, or that visible_devices is now structured differently.

        for group in groups:

src/twinkle/checkpoint_engine/base.py (3)

medium

The torch import is moved inside a TYPE_CHECKING block. This is a good practice for type hinting imports that are not strictly needed at runtime, which can help reduce import times and avoid circular dependencies in some cases. However, ensure that torch is always imported at runtime if it's used outside of type hints in the module.

from abc import ABC, abstractmethod

src/twinkle/advantage/init.py (34-35)

medium

These entries in __all__ are no longer valid since the corresponding functions have been removed. Removing them ensures that the module's public API accurately reflects its contents.

]

src/twinkle/model/megatron/model/gpt_bridge.py (21-22)

medium

The import of is_last_rank has been moved from twinkle.utils.platform to twinkle.utils. This aligns with the overall refactoring of platform-related utilities into a more centralized location within the twinkle.utils package. This change improves consistency and simplifies the import structure.

from twinkle.utils import (MxFp4Dequantizer, SafetensorLazyLoader, StreamingSafetensorSaver, deep_getattr, get_logger,
                           get_modules_to_not_convert, get_multimodal_target_regex, is_last_rank, requires)

src/twinkle/advantage/init.py (7-27)

medium

The removal of these temporary helper functions (compute_advantages and compute_advantages_rloo) is a good step towards cleaning up the codebase. It aligns with the TODO comment indicating they were temporary and not suitable for production. This improves the clarity and long-term maintainability of the advantage module by removing deprecated code.

src/twinkle/checkpoint_engine/base.py (11-12)

medium

Using string literals for type hints like 'torch.Size' and 'torch.dtype' when torch is only imported for type checking is the correct approach. This prevents a runtime import error if torch is not available or if it causes a circular dependency.

src/twinkle/utils/network.py (1-43)

medium

The removal of os, torch, datetime, and the HCCL-related environment variable definitions and functions (_HCCL_IF_BASE_PORT_ENV, _HCCL_HOST_SOCKET_PORT_RANGE_ENV, _HCCL_NPU_SOCKET_PORT_RANGE_ENV, _derive_hccl_socket_env_defaults, _ensure_hccl_socket_env) from network.py is a key part of the refactoring. These functionalities have been moved to the new platforms submodule, specifically npu.py, and stateless_init_process_group to torch_utils.py. This makes network.py more focused on generic network utilities.

import socket
from typing import Optional

src/twinkle/utils/network.py (90-170)

medium

The stateless_init_process_group function has been moved from network.py to torch_utils.py. This is a logical relocation as this function is specifically designed for PyTorch distributed process group initialization, even if it uses network concepts. This improves the thematic organization of the utils package.

src/twinkle/utils/platforms/init.py (1-4)

medium

This new __init__.py file in twinkle/utils/platforms serves to expose the Platform, GPU, MPS, NPU classes and related functions (is_mps_available, ensure_hccl_socket_env, ensure_npu_backend). This is a crucial part of the refactoring, creating a dedicated package for platform-specific abstractions and utilities. This improves modularity and makes the codebase easier to navigate and extend for different hardware platforms.

src/twinkle/utils/platforms/base.py (1-138)

medium

This new file base.py within twinkle/utils/platforms defines the abstract Platform class and its static methods. This centralizes common platform-agnostic functionalities and provides a clear interface for platform-specific implementations (GPU, NPU, MPS). This is a significant improvement in code organization and extensibility.

src/twinkle/utils/platforms/gpu.py (1-26)

medium

This new file gpu.py within twinkle/utils/platforms defines the GPU class, inheriting from Platform. It encapsulates GPU-specific environment variables, device prefixes, and device retrieval logic. This is a good example of adhering to the new modular structure for platform-specific implementations.

src/twinkle/utils/platforms/mps.py (1-42)

medium

This new file mps.py within twinkle/utils/platforms defines the MPS class and the is_mps_available function. It encapsulates MPS-specific environment variables, device prefixes, and device retrieval logic. This is a good example of adhering to the new modular structure for platform-specific implementations.

src/twinkle/utils/platforms/npu.py (1-135)

medium

This new file npu.py within twinkle/utils/platforms defines the NPU class and related helper functions for Ascend NPU. It centralizes HCCL environment variable handling, NPU device ID resolution, and the ensure_npu_backend function. This is a crucial part of the refactoring, moving all NPU-specific logic into a dedicated module and improving code organization.

src/twinkle/utils/torch_utils.py (112-194)

medium

The stateless_init_process_group function has been moved from network.py to torch_utils.py. This is a logical relocation as this function is specifically designed for PyTorch distributed process group initialization, even if it uses network concepts. This improves the thematic organization of the utils package. Additionally, the import for ensure_hccl_socket_env is now correctly pointing to twinkle.utils.platforms.

@tastelikefeet tastelikefeet merged commit 317f9b8 into modelscope:main Feb 24, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants